Session 3 covered:
R markdown
Writing functions
Apply (again)
Installing packages
Using packages
Real life example: DESeq2
# Functions
multiply = function(x, y) return (x * y)
# Default arguments
multiply = function(x, y=2) return (x * y)
# Scopes
multiply = function(x, y) z <<- x * y
# Passing through arguments
apply(m, 1, multiply, y=2)
# Packages from CRAN
install.packages("ggplot2")
# Packges from Bioconductor
library(BiocManager)
install("DESeq2")What did you learn?
Do we need to recap any parts?
Session 4
What makes a good figure?
Using colour
viridis colour library
Plotting with ggplot2
Humans are visual creatures!
Visualisation of large data sets is an essential task in molecular biology and medicine. When done effectively, images can help you to explain the most complex of data.
Without an image or summarisation, how well would you do finding significant genes passing a given fold change threshold in a table like this?
## baseMean log2FoldChange lfcSE pvalue padj
## PENG0000000001 1.1941808 -0.11077611 0.46558089 1.822729e-01 1.000000e+00
## PENG0000000002 0.1394510 -0.01310322 0.42899817 8.004145e-01 1.000000e+00
## PENG0000000004 0.7393672 -0.01877427 0.39858386 7.883606e-01 1.000000e+00
## PENG0000000006 44.3468742 0.08682659 0.19961999 4.041232e-01 6.352165e-01
## PENG0000000007 788.9630161 1.65624481 0.20412453 2.062683e-18 1.994904e-16
## PENG0000000009 1345.8691348 -0.15601837 0.13576046 9.532748e-02 2.327564e-01
## PENG0000000010 4.4468600 0.04479898 0.29043812 6.594170e-01 1.000000e+00
## PENG0000000012 671.1614707 -0.14329193 0.10518787 7.279070e-02 1.885944e-01
## PENG0000000014 0.1394510 -0.01310322 0.42899817 8.004145e-01 1.000000e+00
## PENG0000000015 770.7712490 0.70623622 0.15143560 1.487255e-07 2.841074e-06
## PENG0000000016 5.3302466 0.14048684 0.33883106 2.315969e-01 1.000000e+00
## PENG0000000022 1276.9182728 -0.06159326 0.07899988 3.184644e-01 5.180787e-01
There are a few key points you need to condider when deciding upon a method of visualisation:
relevance: the message a figure needs to convey
salience: how easily the eye can distringuish your message from the background
accuracy: how exactly different visualisation methods may convey your message
Especially when presenting (when the audience is listening and reading), it’s vital that salience and relevance are aligned.
Human vision is highly selective. We understand visual information by selecting, in turn, individual objects or aspects for detailed analysis rather than by appreciating an entire scene.
Formally, salience is the property of an object that sets it apart from its surroundings; it’s a relative properly, therefore, and depends on the collection of objects being visualised.
We can enhance salience by manipulating color, shape, size, and position to focus attention.
It’s not inherently different to make information salient …
… what’s important is that the salient points of a figure align with what’s relevant.
Nevertheless, it’s also easy to reduce salience by:
displaying too much information together
attempting to convey different points of relevance within the same image
referring to points of relevance that are tangential to what’s salient
When displaying a graphic, we want the viewer to be able to perceive the patterns and trends that convey the relevant point.
Humans are better able to interpret certain visual cues better than others, however - interpretation is subjective and everyone’s different! Let’s try and rank the following methods of conveying the same information:
Research in the field of visual theory has shown that, in order, people are best able to understand:
positions on a common aligned scale
positions on unaligned scales that are otherwise common
lengths
angles and slopes
area
volume and colour saturation
colour hue
Unfortunately, although it’s often a go-to method, colour is amongst the least reliable methods of conveying information!
This is worsened by the fact that colour perception is relative. We distinguish colours differently depending on their surroundings and can be easily tricked into seeing the same colour differently or into seeing different colours similarly.
Not only can be we be fooled into perceiving colours contextually, people differ in their ability to distinguish colours based on their genetics.
Colour blindness (colour vision deficiency or CVD) is a sliding scale and there are a number of different types. ‘Full’ phenotypes for the three more common forms of CVD are show here.
Across populations with Northern European ancestry, up to 1/12 males and 1/200 females have some level of red-green CVD. UK-wide, 4.5% of the population have some level of CVD and, even by the time they leave school, approximately 40% are unaware.
When producing figures that make use of colour, if we want them to be salient, it’s important to be inclusive! Thankfully, a number of colour scales have been developed that allow figures to retain salience for those with CVD.
The viridis library provides color maps for use in
R that are:
colourful, spanning as wide a palette as possible
perceptually uniform, such that, across the whole range, nearby values have similar-appearing colors and distant values appear more distinct
friendly for those with colour vision deficiency
viridis can be installed from CRAN.
Although base R can produce a variety of plots, getting
them to be ‘just so’ can be extremely difficult!
These days, virtually all ‘pretty’ plotting is done with the
ggplot2 library.
Given its popularity, lots of libraries interface with
ggplot2 (e.g. viridis). To complement
ggplot2, we’ll also install patchwork, which
helps to layout plots in rows and grids.
The central ggplot() function provides a consistent
interface to map data to the aesthetics of
geometries.
In other words:
the tabular data that we wish to plot …
… has various facets, which we map to the aesthetic properties we can perceive (position, colour, shape, size, and transparency) …
… of various geometric methods of displaying the data (bars, points, lines, etc)
Most commonly, we make a plot by passing the ggplot()
function two arguments:
data, which takes a data frame (or something that
can be coerced to one)
mapping, which uses the aes() function
to define the aesthetic mappings we’d like to use
penguins = na.omit(read.csv("data/1_palmerpenguins.csv"))
penguins$species = factor(penguins$species)
penguins$island = factor(penguins$island)
penguins$sex = factor(penguins$sex)
penguins$year = factor(penguins$year, ordered=TRUE)## [1] "gg" "ggplot"
Here, we’ve instantiated a ggplot object with the
penguins data frame and defined two simple aesthetics -
that the x axis should display bill_length_mm
and that y should display bill_depth_mm.
Let’s have a look at what we’ve made.
Beautiful, we’re finished!
Remember that a ggplot requires data, aesthetics, and
geometries. So far, we haven’t added any geometries to our plot. We can
think of a ggplot as a painting; ggplot()
provides the canvas and the geometries provide the layers of paint.
Geometries (geom_...() functions) added to a
ggplot2 display data according to the aesthetics defined in
the ggplot object. Geometries are added (literally, as we
use the + operator) to the ggplot and we
reassign the result to the original object variable name.
Now we can see something!
What we have so far is a little simplistic! Let’s re-work the
aesthetics by getting the ggplot object colour according to
the levels of the species factor (color and
col are also valid if you’re not British).
The aesthetic properties defined in the ggplot()
function call are set as the defaults for all geometries, provided
they’re applicable. All geometries added will be passed these
defaults.
Here, the linear model (method="lm") trend lines we add
using geom_smooth() inherit their colour
aesthetic from the ggplot parent object.
Let’s make our first ggplot!
Set up a ggplot object of displaying
flipper_length_mm against body_mass_g
Add a colour aesthetic to the plot
Add a geom_smooth(method="lm") geometry to the
plot
What happens if we exchange geom_point() for
geom_density2d()?
The more factors we map to aesthetics, the more we partition the data.
g = ggplot(penguins, aes(x=bill_length_mm, y=bill_depth_mm, colour=species, shape=sex))
g +
geom_point() +
geom_smooth(method="lm")Here, the shape=sex aesthetic further subdivided the
data by sex, giving 6 trend lines instead of 3.
We don’t always want our geometries to inherit all of the defaults
passed to the ggplot() function, therefore! Thankfully:
individual geometries can have alternate aesthetics passed using
the mapping= argument, which allows individual aesthetic
parameters to be reset to NULL.
geom_point(mapping=aes(shape=NULL))
aesthetics can be manually specified for all data by passing the aesthetic parameter as an argument itself
geom_point(colour="black")
Here, we pass alpha=0.5 to geom_point() to
manually set the alpha (transparency) of the points. Additionally, we
override the shape aesthetic within
geom_smooth() by setting it to NULL so that we
don’t duplicate our trend lines.
Depending on the geometry, different aesthetic options are more relevant than others. Some aesthetics are only available for specific geometries.
As a general guide, discrete variables are best visually separated with …
shape for very small numbers of discrete groups and
where overlapping is minimal
linetype for very small numbers of discrete
groups
colour or fill for filled geometries
(where colour alters the border) using separate
hues
… whereas continuous variables are most accurately displayed using:
size or lineweight
alpha
saturation of a single colour
Separately to the aesthetics of the geometries, we can control other
visual aspects of the plot by modifying the theme()
defaults or by using one of the theme_...() presets. We can
also use the labs() function to control the axis and plot
titles.
Let’s add some style to our ggplot.
Update your previous plot to use a theme preset. Try out
theme_bw(), theme_classic(), and
theme_minimal().
Even when using a theme preset, we can still override specific
elements using theme(). What does adding
theme(axis.text.y=element_text(angle=90, vjust=0.5, hjust=0.5))
achieve?
How might we rotate the labels for the x axis?
Looking at the help for element_text(), how might we
change the size of the plot.title element?
There are many geometries available to help display data of different formats. The majority of graphing applications fall into five groups:
single continuous variable: geom_freqpoly(),
geom_histogram(), geom_area(),
geom_density()
single discrete variable: geom_bar()
two continuous variables: geom_point(),
geom_smooth(), geom_rug(),
geom_density_2d()
two variables, one continuous and one discrete:
geom_boxplot(), geom_violin(),
geom_dotplot(), geom_jitter(),
geom_col()
two discrete variables: geom_count(),
geom_jitter()
Let’s see a few options for plotting a continuous against a discrete variable.
Different geometries can be used to highlight - make salient -
different aspects of the data. Here, geom_boxplot() better
shows the position of the median value, whereas
geom_violin() better highlights the spread of the data and
geom_jitter() might do that too much!
Let’s see a few options for plotting two continuous variables against each other.
Here, as an alternative to geom_point(),
geom_density2d() may better highlight the ‘centre of mass’
for each group. geom_rug() might be suitable alongside
another geometry but otherwise lacks both accuracy and saliency.
Using this base …
… let’s compare a couple of options for plotting a single continuous variable.
Add a geom_histogram() to the plot
Switch that for a geom_freqpoly()
Which plot has better accuracy and saliency?
Let’s apply some of our knowledge about using colour effectively … and inclusively!
The viridis library integrates easily with
ggplot2 using various
scale_colour_viridis_...() functions:
scale_colour_viridis_d() makes salient mappings of
discrete variables
scale_colour_viridis_c() makes accurately
distinguishable mappings for continuous variables
scale_colour_viridis_b() merges the two - continuous
variables are binned to enhance salience
Let’s start by replacing ggplot2’s default colour scheme
for discrete variables using the scale_colour_viridis_d()
(d for discrete) function.
ggplot(penguins, aes(x=body_mass_g, y=flipper_length_mm, colour=species)) +
geom_point() +
theme_bw() +
scale_colour_viridis_d()Pretty easy!
As we saw earlier, viridis has a variety of palettes.
Let’s compare how good they are at visually discriminating discrete data
by passing the option= argument to the
scale_colour_viridis_d() function.
There are more penguin-related homework tasks to help cement what we’ve covered today!
The homework and instructions can be found within the main directory
for the course: ./homework/Homework_4.Rmd